[Draft] 355_benchmark by wenxie-amd · Pull Request #570 · AMD-AGI/Primus

wenxie-amd · 2026-02-27T12:46:41Z

No description provided.

Copilot

Pull request overview

Adds benchmark/training convenience scripts and configuration tweaks for MI355X DeepSeek pretraining runs, plus enhanced training-log memory reporting and some workflow/build adjustments.

Changes:

Add new launcher scripts for DeepSeek V3 and DeepSeek V2 Lite training, and a C4 data preparation helper.
Extend Megatron training-log patching to report the rank with max ROCm memory usage.
Update Slurm/docker launch behavior and Megatron prepare/build flags; adjust DeepSeek V3 MI355X config defaults; comment out CI unit-test jobs.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
start_training_dsv3.sh	New MI355X DeepSeek V3 Slurm launch script with env defaults and run flags.
start_training_dsv2_lite.sh	New MI355X DeepSeek V2 Lite Slurm launch script with env defaults and run flags.
prepare_c4_data.sh	New helper to merge pre-downloaded C4 shards and tokenize for Megatron.
primus/backends/megatron/patches/training_log/print_rank_last_patches.py	Add distributed aggregation to log the max ROCm memory usage rank.
examples/run_slurm_pretrain.sh	Allow overriding srun time/partition via env.
examples/run_pretrain.sh	Update AINIC NCCL IB TC/FIFO TC env settings.
examples/run_local_pretrain.sh	Prefer docker over podman and pull image before running.
examples/megatron/prepare.py	Install Emerging-Optimizers with `--no-build-isolation`.
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml	Change PP default and add Turbo/Deepep-related overrides.
.github/workflows/ci.yaml	Comment out torch/jax unit-test jobs.

Copilot · 2026-02-27T14:03:22Z

examples/run_local_pretrain.sh

+docker_podman_proxy pull "$DOCKER_IMAGE"
+


Unconditionally pulling the image before running means this script will now fail in offline/air-gapped environments (or when using a locally-built image tag) even if the image is already present locally. Consider making the pull optional (e.g., behind a PULL_DOCKER_IMAGE flag) or using a runtime option like "--pull=missing" where supported.

Copilot · 2026-02-27T14:03:22Z

start_training_dsv2_lite.sh

+export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1"
+export GLOO_SOCKET_IFNAME=ens9np0
+export NCCL_SOCKET_IFNAME=ens9np0
+export CLEAN_DOCKER_CONTAINER=1


CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.

Suggested change

export CLEAN_DOCKER_CONTAINER=1

# Set to 1 to allow run_slurm_pretrain.sh to clean up all Docker/Podman containers on the host.

# Use 1 only on dedicated/non-shared nodes where this is safe.

export CLEAN_DOCKER_CONTAINER=0

Copilot · 2026-02-27T14:03:22Z

start_training_dsv3.sh

+export PRIMUS_USER=tas
+# export PRIMUS_TOKENIZED_DATA_PATH=/shared_aig/c4/tokenized/c4_en_train_text_document # this is the tokenized data path for the training
+export PRIMUS_EXP_NAME=dsv3-pretrain-mbs_$MBS-gbs_$GBS-PP_$PRIMUS_PP-EP_$PRIMUS_EP-VPP_$PRIMUS_VPP-turbodeepep_$TURBO_DEEPEEP-legacygg_$LEGACY_GG-profile_$PROFILE
+export PRIMUS_EXP_NAME=debugv3


PRIMUS_EXP_NAME is assigned twice; the second assignment overrides the descriptive name built from the run parameters, so logs/output will always go to the hard-coded "debugv3" directory. Remove the second export or gate it behind an explicit debug flag to avoid accidental output collisions.

Suggested change

export PRIMUS_EXP_NAME=debugv3

: "${PRIMUS_DEBUG_EXP_NAME:=0}"

if [ "$PRIMUS_DEBUG_EXP_NAME" = "1" ]; then

export PRIMUS_EXP_NAME=debugv3

fi

Copilot · 2026-02-27T14:03:22Z

prepare_c4_data.sh

+# By default downloads 1 shard (~350MB compressed, ~3M documents) for testing.
+# Full C4-en has 1024 shards. Adjust --num_shards for more data.
+###############################################################################
+
+set -e
+
+# ======================== Configuration ========================
+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
+DATA_DIR=${DATA_DIR:-"/shared/c4"}
+PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
+TOKENIZER_TYPE="DeepSeekV3Tokenizer"
+TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
+WORKERS=${WORKERS:-$(nproc)}          # Number of preprocessing workers
+HF_TOKEN=${HF_TOKEN:-"your_hf_token"}             # Set your HuggingFace token


The header comment says the script "downloads" C4 shards and that the default is 1 shard, but the implementation explicitly skips downloading and defaults NUM_SHARDS to 200. Please align the documentation and defaults with the actual behavior (either implement download, or update comments and set NUM_SHARDS default accordingly).

Copilot · 2026-02-27T14:03:23Z

prepare_c4_data.sh

+    for i in $(seq 0 $((NUM_SHARDS - 1))); do
+        SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
+        SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
+        echo "  [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
+        zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
+    done
+


This merge step appends directly into the final JSONL_FILE with '>>'. If the script is interrupted or a shard is corrupt, you'll end up with a partial JSONL that then causes future runs to skip merging because the file exists. Write to a temporary file and atomically move it into place on success (and/or validate the output) to avoid leaving a bad cached artifact.

Suggested change

for i in $(seq 0 $((NUM_SHARDS - 1))); do

SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")

SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"

echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."

zcat "${SHARD_PATH}" >> "${JSONL_FILE}"

done

# Write to a temporary file first to avoid leaving a corrupted final JSONL

TMP_JSONL_FILE="$(mktemp "${JSONL_DIR}/c4_en_train.jsonl.tmp.XXXXXX")"

# Ensure the temporary file is cleaned up on failure or interruption

cleanup_tmp() {

if [ -n "${TMP_JSONL_FILE:-}" ] && [ -f "${TMP_JSONL_FILE}" ]; then

rm -f "${TMP_JSONL_FILE}"

fi

}

trap cleanup_tmp EXIT INT TERM

for i in $(seq 0 $((NUM_SHARDS - 1))); do

SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")

SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"

echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."

zcat "${SHARD_PATH}" >> "${TMP_JSONL_FILE}"

done

# Basic validation: ensure the merged file is non-empty before finalizing

if [ ! -s "${TMP_JSONL_FILE}" ]; then

echo "ERROR: Merged JSONL is empty; aborting."

cleanup_tmp

exit 1

fi

# Move the completed temp file into place atomically

mv "${TMP_JSONL_FILE}" "${JSONL_FILE}"

# Prevent trap from deleting the now-final JSONL file

TMP_JSONL_FILE=""

trap - EXIT INT TERM

Copilot · 2026-02-27T14:03:23Z

start_training_dsv3.sh

+export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1"
+export GLOO_SOCKET_IFNAME=ens9np0
+export NCCL_SOCKET_IFNAME=ens9np0
+export CLEAN_DOCKER_CONTAINER=1


CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.

Suggested change

export CLEAN_DOCKER_CONTAINER=1

export CLEAN_DOCKER_CONTAINER=0 # Set to 1 to forcibly remove all Docker/Podman containers; use with caution on shared hosts

Copilot · 2026-02-27T14:03:24Z

primus/backends/megatron/patches/training_log/print_rank_last_patches.py

                r_ratio = r_used / r_total
+
+                # get the max rocm_mem_usage
+                usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)


usage_tensor is created as float32 even though r_used is a byte count on the order of 1e11 for large GPUs. float32 will lose integer precision at that scale, which can make max-rank selection and reported GB values slightly inaccurate. Prefer using an integer dtype (e.g., int64) for the gathered byte counts, then format as GB on CPU.

Suggested change

usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)

usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64)

Copilot · 2026-02-27T14:03:24Z

start_training_dsv3.sh

+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


This script exports placeholder values for HF_TOKEN and WANDB_API_KEY, which will overwrite any real credentials already set in the environment and can cause training/CI runs to fail unless the file is edited. Prefer reading from the environment and erroring if unset (or only setting defaults when variables are empty), rather than hard-coding placeholder strings.

Suggested change

export HF_TOKEN="your_hf_token" # make it your own hf token

export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

# Require HF_TOKEN and WANDB_API_KEY to be provided via the environment.

if [ -z "${HF_TOKEN:-}" ]; then

echo "Error: HF_TOKEN environment variable is not set. Please export HF_TOKEN before running this script." >&2

exit 1

fi

if [ -z "${WANDB_API_KEY:-}" ]; then

echo "Error: WANDB_API_KEY environment variable is not set. Please export WANDB_API_KEY before running this script." >&2

exit 1

fi

export HF_TOKEN

export WANDB_API_KEY

Copilot · 2026-02-27T14:03:24Z

prepare_c4_data.sh

+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
+DATA_DIR=${DATA_DIR:-"/shared/c4"}
+PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}


PRIMUS_PATH defaults to a user-specific absolute path (/shared/john/Primus), which makes this script non-portable and likely to fail for other users/environments. Consider requiring PRIMUS_PATH to be provided (and exit with a clear message if unset) or deriving it relative to the repo root.

Suggested change

NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)

DATA_DIR=${DATA_DIR:-"/shared/c4"}

PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}

SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"

NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)

DATA_DIR=${DATA_DIR:-"/shared/c4"}

PRIMUS_PATH=${PRIMUS_PATH:-"${SCRIPT_DIR}/../Primus"}

if [[ ! -d "$PRIMUS_PATH" ]]; then

echo "Error: PRIMUS_PATH is not set to a valid directory: '$PRIMUS_PATH'" >&2

echo "Please set PRIMUS_PATH explicitly, for example:" >&2

echo " export PRIMUS_PATH=/path/to/Primus" >&2

exit 1

fi

Copilot · 2026-02-27T14:03:25Z

.github/workflows/ci.yaml

+  # run-unittest-torch:
+  #   env:
+  #     PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
+  #   needs: [code-lint]
+  #   runs-on: [primus-lm-cicd-torch-j8knc]
+  #   steps:
+  #     - run: echo "🎉 Begin Primus-Turbo Checkout."
+  #     - name: Set commit hash to env


The unit test jobs for both torch and jax are commented out, which removes automated test execution from CI and can allow regressions (including in this PR) to merge unnoticed. If this is temporary, consider gating them behind a workflow input/label, reducing the test scope, or moving them to a separate workflow instead of disabling them entirely.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (2)

prepare_c4_data.sh:30

HF_TOKEN is defined here but is never used by this script (Step 1 explicitly skips download, and Step 2 calls preprocess_data.py without passing auth). Either remove this variable or use it in the download/auth flow to avoid implying it has an effect.

TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
WORKERS=${WORKERS:-$(nproc)}          # Number of preprocessing workers
HF_TOKEN=${HF_TOKEN:-"your_hf_token"}             # Set your HuggingFace token

prepare_c4_data.sh:139

The summary message references run_dsv3.sh, but this repo’s training entrypoints appear to be start_training_dsv3.sh / examples/run_slurm_pretrain.sh. Consider updating the wording to point to the actual script(s) users should run to avoid confusion.

echo "To use this data for training, set in run_dsv3.sh:"
echo ""
echo "  1. Change:  --mock_data True  →  --mock_data False"
echo "  2. Add env:  export PRIMUS_TOKENIZED_DATA_PATH=${TRAIN_OUTPUT_PREFIX}_text_document"

Copilot · 2026-03-03T03:13:29Z

examples/run_slurm_pretrain.sh

+     --time="${SLURM_TIME:-07:00:00}" \
+     --nodelist="${SLURM_NODELIST:-}" \
+     --partition="${SLURM_PARTITION:-amd-aig}" \


Passing --nodelist with an empty value (when SLURM_NODELIST is unset) causes srun to fail with an invalid nodelist. Consider only adding --nodelist when SLURM_NODELIST is non-empty (or default to omitting the flag entirely).

Copilot · 2026-03-03T03:13:29Z

examples/run_local_pretrain.sh

 ARGS=("$@")

-VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
+VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")


This hard-codes an extra bind mount to /shared_aig/c4, which will break (or unintentionally create an empty host directory) on systems that don’t have that path. Consider making this conditional/configurable via an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_EXTRA_MOUNTS) rather than always mounting a cluster-specific path.

Suggested change

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")

# Optional extra volume mounts: set PRIMUS_EXTRA_MOUNTS to a string like:

# '-v /shared_aig/c4:/shared_aig/c4 -v /other/path:/other/path:ro'

if [[ -n "${PRIMUS_EXTRA_MOUNTS:-}" ]]; then

# Intentional word splitting to allow multiple -v arguments.

VOLUME_ARGS+=(${PRIMUS_EXTRA_MOUNTS})

elif [[ -d "/shared_aig/c4" ]]; then

# Backwards-compatible default: only mount /shared_aig/c4 if it exists.

VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")

fi

Copilot · 2026-03-03T03:13:29Z

prepare_c4_data.sh

+set -e
+
+# ======================== Configuration ========================
+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)


The header comment says the default downloads 1 shard for testing, but NUM_SHARDS is set to default to 200. Please update either the comment or the default so they match (and avoid surprising users by downloading/processing 200 shards by default).

Suggested change

NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)

NUM_SHARDS=${NUM_SHARDS:-1} # Number of C4 shards to download (1-1024)

Copilot · 2026-03-03T03:13:30Z

examples/megatron/prepare.py

+    ret = subprocess.run(
+        ["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True
+    )
    if ret.returncode != 0:
        log_error_and_exit("Building Emerging Optimizers failed.")


subprocess.run(..., check=True) will raise an exception on failure, so the subsequent if ret.returncode != 0: branch is redundant/unreachable. Either remove the returncode check, or set check=False and keep the explicit returncode handling (and ideally invoke pip via sys.executable -m pip to ensure the active Python environment is used).

Copilot · 2026-03-03T03:13:30Z

.github/workflows/ci.yaml

+  # run-unittest-torch:
+  #   env:
+  #     PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
+  #   needs: [code-lint]
+  #   runs-on: [primus-lm-cicd-torch-j8knc]
+  #   steps:
+  #     - run: echo "🎉 Begin Primus-Turbo Checkout."
+  #     - name: Set commit hash to env
+  #       run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
+  #     - name: Checkout Repo Primus-Turbo
+  #       uses: actions/checkout@v4
+  #       with:
+  #         repository: AMD-AIG-AIMA/Primus-Turbo
+  #         submodules: "recursive"
+  #         path: Primus-Turbo
+  #         ref: ${{ env.PRIMUS_TURBO_COMMIT }}
+  #     - run: echo "Begin Primus-Turbo Install."
+  #     - name: Install Primus-Turbo
+  #       run: |
+  #         mv Primus-Turbo /tmp/
+  #         echo "Primus-Turbo dir: /tmp/Primus-Turbo"
+  #         git config --global --add safe.directory /tmp/Primus-Turbo
+  #         cd /tmp/Primus-Turbo
+  #         start_time=$(date +%s)
+  #         echo "✅ [Pip install requirements] started at: $(date)"
+  #         mkdir -p ${PRIMUS_WORKDIR}/primus-cache
+  #         MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt


The Torch/JAX unit test jobs are entirely commented out, which effectively disables CI test execution for PRs and can allow regressions to merge undetected. If this is only needed temporarily, consider gating the jobs behind an if: condition (e.g., on a label/branch) rather than removing them from the workflow, or re-enable them before merging.

Suggested change

# run-unittest-torch:

# env:

# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch

# needs: [code-lint]

# runs-on: [primus-lm-cicd-torch-j8knc]

# steps:

# - run: echo "🎉 Begin Primus-Turbo Checkout."

# - name: Set commit hash to env

# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV

# - name: Checkout Repo Primus-Turbo

# uses: actions/checkout@v4

# with:

# repository: AMD-AIG-AIMA/Primus-Turbo

# submodules: "recursive"

# path: Primus-Turbo

# ref: ${{ env.PRIMUS_TURBO_COMMIT }}

# - run: echo "Begin Primus-Turbo Install."

# - name: Install Primus-Turbo

# run: |

# mv Primus-Turbo /tmp/

# echo "Primus-Turbo dir: /tmp/Primus-Turbo"

# git config --global --add safe.directory /tmp/Primus-Turbo

# cd /tmp/Primus-Turbo

# start_time=$(date +%s)

# echo "✅ [Pip install requirements] started at: $(date)"

# mkdir -p ${PRIMUS_WORKDIR}/primus-cache

# MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt

run-unittest-torch:

env:

PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch

needs: [code-lint]

runs-on: [primus-lm-cicd-torch-j8knc]

steps:

- run: echo "🎉 Begin Primus-Turbo Checkout."

- name: Set commit hash to env

run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV

- name: Checkout Repo Primus-Turbo

uses: actions/checkout@v4

with:

repository: AMD-AIG-AIMA/Primus-Turbo

submodules: "recursive"

path: Primus-Turbo

ref: ${{ env.PRIMUS_TURBO_COMMIT }}

- run: echo "Begin Primus-Turbo Install."

- name: Install Primus-Turbo

run: |

mv Primus-Turbo /tmp/

echo "Primus-Turbo dir: /tmp/Primus-Turbo"

git config --global --add safe.directory /tmp/Primus-Turbo

cd /tmp/Primus-Turbo

start_time=$(date +%s)

echo "✅ [Pip install requirements] started at: $(date)"

mkdir -p ${PRIMUS_WORKDIR}/primus-cache

MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt

Copilot · 2026-03-03T03:13:30Z

start_training_dsv2_lite.sh

+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


This script exports HF_TOKEN and WANDB_API_KEY as literal strings, which both encourages storing secrets in a committed file and also overwrites any values already present in the caller’s environment. Prefer reading these from the environment (and erroring if missing) rather than exporting placeholder values here.

Suggested change

export HF_TOKEN="your_hf_token" # make it your own hf token

export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

: "${HF_TOKEN:?Environment variable HF_TOKEN must be set}"

: "${WANDB_API_KEY:?Environment variable WANDB_API_KEY must be set}"

export HF_TOKEN

export WANDB_API_KEY

Copilot · 2026-03-03T03:13:30Z

primus/backends/megatron/patches/training_log/print_rank_last_patches.py

+                # get the max rocm_mem_usage
+                usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
+                world_size = torch.distributed.get_world_size()
+                gathered_usage = [torch.zeros_like(usage_tensor) for _ in range(world_size)]
+                torch.distributed.all_gather(gathered_usage, usage_tensor)
+
+                rocm_mem_usages = [t.item() for t in gathered_usage]
+                max_usage = max(rocm_mem_usages)
+                max_rank = rocm_mem_usages.index(max_usage)
+
                rocm_mem_str = (
                    f" | rocm mem usage/free/total/usage_ratio: "
                    f"{r_used / 1024 ** 3:.2f}GB/"
                    f"{r_free / 1024 ** 3:.2f}GB/"
                    f"{r_total / 1024 ** 3:.2f}GB/"
                    f"{r_ratio * 100:.2f}%"
+                    f" | rank-{max_rank} rocm max mem usage/usage_ratio: "
+                    f"{max_usage / 1024 ** 3:.2f}GB/"
+                    f"{max_usage / r_total * 100:.2f}%"


max_usage / r_total uses the local rank’s r_total when computing the max-usage ratio, which can be incorrect if ranks have different total HBM sizes (or if totals differ for any reason). To make this accurate, gather each rank’s r_total (or gather precomputed r_used/r_total ratios) and compute the max ratio corresponding to max_rank.

Copilot · 2026-03-03T03:13:30Z

examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml

      # parallel
      tensor_model_parallel_size: ${PRIMUS_TP:1}
-      pipeline_model_parallel_size: ${PRIMUS_PP:1}
+      pipeline_model_parallel_size: ${PRIMUS_PP:8}


This config changes the default pipeline_model_parallel_size from the common default of 1 to 8, which is inconsistent with the MI355X DeepSeek-V3 FP8 config and most other configs. To avoid surprising users (and breaking single-node runs), consider keeping the default at 1 and relying on PRIMUS_PP to opt into PP>1.

Suggested change

pipeline_model_parallel_size: ${PRIMUS_PP:8}

pipeline_model_parallel_size: ${PRIMUS_PP:1}

wenxie-amd added 2 commits February 27, 2026 12:44

dsv2_lite script; c4_data

4a7af06

disable ci ut

a5d1f43

wenxie-amd requested review from Copilot and removed request for Copilot February 27, 2026 12:46

Copilot started reviewing on behalf of wenxie-amd February 27, 2026 12:47 View session

dsv3

7ca203a

Copilot AI review requested due to automatic review settings February 27, 2026 13:58

Copilot started reviewing on behalf of wenxie-amd February 27, 2026 13:59 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

AMD-AGI deleted a comment from Copilot AI Feb 28, 2026

JohnQinAMD added 2 commits March 2, 2026 02:34

update configs

a693403

tune pp-vpp and mbs

f16821a

Copilot AI review requested due to automatic review settings March 3, 2026 03:03

Copilot started reviewing on behalf of JohnQinAMD March 3, 2026 03:05 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

-export CLEAN_DOCKER_CONTAINER=1
+# Set to 1 to allow run_slurm_pretrain.sh to clean up all Docker/Podman containers on the host.
+# Use 1 only on dedicated/non-shared nodes where this is safe.
+export CLEAN_DOCKER_CONTAINER=0

-    for i in $(seq 0 $((NUM_SHARDS - 1))); do
-        SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
-        SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
-        echo "  [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
-        zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
-    done
+    # Write to a temporary file first to avoid leaving a corrupted final JSONL
+    TMP_JSONL_FILE="$(mktemp "${JSONL_DIR}/c4_en_train.jsonl.tmp.XXXXXX")"
+    # Ensure the temporary file is cleaned up on failure or interruption
+    cleanup_tmp() {
+        if [ -n "${TMP_JSONL_FILE:-}" ] && [ -f "${TMP_JSONL_FILE}" ]; then
+            rm -f "${TMP_JSONL_FILE}"
+        fi
+    }
+    trap cleanup_tmp EXIT INT TERM
+    for i in $(seq 0 $((NUM_SHARDS - 1))); do
+        SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
+        SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
+        echo "  [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
+        zcat "${SHARD_PATH}" >> "${TMP_JSONL_FILE}"
+    done
+    # Basic validation: ensure the merged file is non-empty before finalizing
+    if [ ! -s "${TMP_JSONL_FILE}" ]; then
+        echo "ERROR: Merged JSONL is empty; aborting."
+        cleanup_tmp
+        exit 1
+    fi
+    # Move the completed temp file into place atomically
+    mv "${TMP_JSONL_FILE}" "${JSONL_FILE}"
+    # Prevent trap from deleting the now-final JSONL file
+    TMP_JSONL_FILE=""
+    trap - EXIT INT TERM

	export CLEAN_DOCKER_CONTAINER=1
	export CLEAN_DOCKER_CONTAINER=0 # Set to 1 to forcibly remove all Docker/Podman containers; use with caution on shared hosts

	usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
	usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64)

		export HF_TOKEN="your_hf_token" # make it your own hf token
		export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

-export HF_TOKEN="your_hf_token"  # make it your own hf token
-export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key
+# Require HF_TOKEN and WANDB_API_KEY to be provided via the environment.
+if [ -z "${HF_TOKEN:-}" ]; then
+  echo "Error: HF_TOKEN environment variable is not set. Please export HF_TOKEN before running this script." >&2
+  exit 1
+fi
+if [ -z "${WANDB_API_KEY:-}" ]; then
+  echo "Error: WANDB_API_KEY environment variable is not set. Please export WANDB_API_KEY before running this script." >&2
+  exit 1
+fi
+export HF_TOKEN
+export WANDB_API_KEY

-NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
-DATA_DIR=${DATA_DIR:-"/shared/c4"}
-PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
+SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"
+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
+DATA_DIR=${DATA_DIR:-"/shared/c4"}
+PRIMUS_PATH=${PRIMUS_PATH:-"${SCRIPT_DIR}/../Primus"}
+if [[ ! -d "$PRIMUS_PATH" ]]; then
+    echo "Error: PRIMUS_PATH is not set to a valid directory: '$PRIMUS_PATH'" >&2
+    echo "Please set PRIMUS_PATH explicitly, for example:" >&2
+    echo "  export PRIMUS_PATH=/path/to/Primus" >&2
+    exit 1
+fi

-VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
+VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
+# Optional extra volume mounts: set PRIMUS_EXTRA_MOUNTS to a string like:
+#   '-v /shared_aig/c4:/shared_aig/c4 -v /other/path:/other/path:ro'
+if [[ -n "${PRIMUS_EXTRA_MOUNTS:-}" ]]; then
+    # Intentional word splitting to allow multiple -v arguments.
+    VOLUME_ARGS+=(${PRIMUS_EXTRA_MOUNTS})
+elif [[ -d "/shared_aig/c4" ]]; then
+    # Backwards-compatible default: only mount /shared_aig/c4 if it exists.
+    VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")
+fi

	NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
	NUM_SHARDS=${NUM_SHARDS:-1} # Number of C4 shards to download (1-1024)

	pipeline_model_parallel_size: ${PRIMUS_PP:8}
	pipeline_model_parallel_size: ${PRIMUS_PP:1}

Conversation

wenxie-amd commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants